/*******************************************************************************
SIT-D replication source code.

	[Script No: 1]
	[Last updated: June 4th, 2024]
	
	This script consolidates the four independent data sources that constitute
	Sit-D Administrative dataset. The data sources are: 
	
		(A) Sit-D programatic data
			This dataset include the information of 2,070 Sit-D officers,
			gather at baseline. It covers information about their study status
			(Sit-D or control), demographics information, and patrol assignment
			at the time of randomization
		
		(B) Randomized training dates
			This dataset provides the cross walk to identify whether 
			a given month belong to the pre-training, training, or post-training
		
		(C) CPD data
			This dataset contains all administrative outcomes for the officers,
			covering the period between 01/2018 to 04/2022. It forms the basic
			of our administrative analysis
			
		(D) Census data															***
			Census data aggreagated at the district level. 
			
		
	The output of this script is a dataset at officer-month level, containing
	all the measurements for administrative analysis except for indexs.
	
	Index variables are created in the next code [Script No. 2]

*******************************************************************************/

	
* (1) Set file path ------------------------------------------------------------

	// Inherited from script [master]

	
* (2) Baseline characteristics -------------------------------------------------

	* (2.1) Clean the sample and create officers / units variables
	
	
		* (2.1.1) Open the full sample
		use $master, clear
	
	
		* (2.1.2) Merge in Demographics and charateristics at baseline
		drop unit
		gen id = employee_id
		merge 1:1 id using $demo, nogen keep(1 3) keepusing(unit pers_yoj)

			* Demographics: 
			* The coded values, e.g., "WHI" and "BLK" is as per CPD's standard
			gen dmale 		= (sex == 1)
			gen dwhite 		= inlist(race, "WHI")
			gen dblack 		= inlist(race, "BLK")
			gen dhispanic 	= inlist(race, "WWH", "WBH", "S")
			gen dboth 		= inlist(race, "BLK", "WWH", "WBH", "S")
			gen dother 		= !inlist(race, "WWH", "WBH", "BLK", "WHI", "S")
		
			* Experience:
			* Years of experience in CPD, measured in full year values
			gen experience = pers_yoj
	
	
		* (2.1.3) Merge census data at district level
		destring unit, gen(district) 
		// The column "unit" in Sit-D masterfile is district at randomization

			/* For officers on a multiple district basic (outside of 1-25),
			   We create a corresponding district code for them, as below
			   This is to support in the merging of Census data, which is coded
			   similarly 
			*/ 
			
			replace district = 101 	if 	inlist(district, 211, 311)
			replace district = 102 	if 	inlist(district, 212, 312)
			replace district = 103 	if 	inlist(district, 213, 313)
			replace district = 1000 if  inlist(district, 193)
			
			* Merge Census data on district level
			merge m:1 district using $census, nogen keep(1 3)
			
			
		* (2.1.4) Ratio of treatment officers within each strata
		bys strata: gen total = _N						// Total in strata
		bys strata: egen treat_count = sum(treatment)	// Treatment in strata
		gen ratio = treat_count / total					// Ratio
		
			/* This is "exogeneous" strata: it represent the unit-watch that the
			officers were assigned to at the time of randomization */
		
	
	
* (3) Expand the data to officer-month level -----------------------------------

	/* Up to this point, the data is at officer level: each officer has one row
   Here, we expand the data to be at officer-month level, for each month
   from 2017-2022 */
	
	* (3.1) Expand the data by year - one row for each year
	expand 6
	bys employee_id: gen year = 2016 + _n

	* (3.2) Expand the data by month - one row for each month
	expand 12
	bys employee_id year: gen year_month = year * 100 + _n
	bys employee_id year: gen month = _n
			
	* (3.3) Check the data form - we should have 2,070 rows for each month
	tab year_month treatment
	drop year month
	
	// 1011C and 1059T officers for each year-month
	
	/*Note on "year_month":
	This is an important variable that is used throughout the pipeline, it takes
	the form "YYYYMM" as in 202001 for January 2020. This variable is later used
	to identify the months before and after Sit-D training
	*/

		
				
* (4) Merge in administrative data ---------------------------------------------
	
	* (4.1) Merge in administrative data. 
	merge m:1 employee_id year_month using $admin_data, nogen
	
		/* Note on this data
		
		** This data contain officer outcome at the officer-month level. The 
		outcomes included in this data set are:
		- Uses of force
		- Arrests
		- Other officer activities (components) 
		- Complaints and commendation 
		- Unit/Watch where officer spend time each month
		
		** Arrest data contains CPD arrests. Each arrest is categorized into
		categories using the FBI code associated with top-line charge.
		For detailed description of what each category entails, see
		the attached "fbi_cat.csv"	
		
		** In the subsequent scripts, reader may encounter variable names with 
		suffix "_C" which stand for raw count.
		
		** For a detailed description of the variable name, see
		Analysis Script - Latex table
	
	
		*/
		
	
* (5) Merge in randomized "training period dates" ------------------------------

	* Draw attendance from the "date" and specification file (quite large)
	merge m:1 employee_id using $att_file, nogen keep(1 3) ///
	keepusing( *${spec}* trained)
	
		/*Note:
		The "att" dataset contains the training-period and post-training month 
		The mechanics of this randomization is as follows:
		
		1) For ALL officer: Define the first month of training period as March of 2020
		
		2) For Treatment officers: 
		Define the first month of training period as March of 2020
		
		3) For Control officers
		- Define the last month of training period:

			+ For control officers: 
			    | First sort the officer by officer ID
				| Each officer is given a randomized number from the 
				  uniform distribution, with seed number "60610"
				| The officers are sorted by this randomized number
			    | Using this order, assigned enough number of control officers
				  to a given calendar month such that the proportion
				  of control officers "finished" training in that month
				  match the proportion of that month in the treatment group
		*/
			
			
			
* (6) Indicate the pre-training, post-training, during-training periods --------

	* (6.1) Looping through the set of specification and identify the periods
		
		/* This part of the code just create a set of variable that indicate
		whether an officer-month record belong to the pre-training period, the
		training period, and the post-training period. This part also create 
		the interaction term betwen the post training period and the treatment 
		indicator */
			
			* Post period: months after training end
			gen post_$spec		= (year_month > $spec_training_ended)
			
			* Pre period: months before training start
			gen pre_$spec		= (year_month < $spec_training_started)
			
			* Training period
			gen training_$spec	= inrange(year_month, ///
								$spec_training_started, $spec_training_ended)
											
			* Interaction term for {Treatment * Post-training}
			gen treat_post_$spec = treatment * post_$spec
	


	* Identify missingness
	gen missing = (trr_C == .)
		
		/* Since our data is not imputed with zero for officers with absolute no outcome
		The total TRR count (or any other outcome) 
		is an indicator for missingness */

		

	
	
			
	
	
	
* Save -------------------------------------------------------------------------

	sort employee_id year_month
	save $output1, replace

	